The palmerpenguins data contains size measurements for
three penguin species observed on three islands in the Palmer
Archipelago, Antarctica.
These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy.
You can install the released version of palmerpenguins from CRAN with:
install.packages("palmerpenguins")
Or install the development version from GitHub with:
# install.packages("remotes")
remotes::install_github("allisonhorst/palmerpenguins")
This package contains two datasets:
Here, we’ll focus on a curated subset of the raw data in the
package named penguins.
The raw data, accessed from the Environmental Data Initiative (see
full data citations below), is also available as
palmerpenguins::penguins_raw.
The curated palmerpenguins::penguins dataset contains 8
variables (n = 344 penguins). You can read more about the variables by
typing ?penguins.
glimpse(penguins)
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex <fct> male, female, female, NA, female, male, female, male…
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
The palmerpenguins::penguins data contains 333 complete
cases, with 19 missing values.
Challenge! Let’s find the smallest penguin observed in each species.
penguins %>%
group_by(species) %>%
filter(body_mass_g == min(body_mass_g, na.rm = TRUE))
#> # A tibble: 4 × 8
#> # Groups: species [3]
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Biscoe 36.5 16.6 181 2850
#> 2 Adelie Biscoe 36.4 17.1 184 2850
#> 3 Gentoo Biscoe 42.7 13.7 208 3950
#> 4 Chinstrap Dream 46.9 16.6 192 2700
#> # ℹ 2 more variables: sex <fct>, year <int>
Practice mutating – let’s create a new column that has bill size (area, in square milimeters)
penguins %>%
mutate(bill_size_mm2 = bill_depth_mm * bill_length_mm) %>%
head()
#> # A tibble: 6 × 9
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750
#> 2 Adelie Torgersen 39.5 17.4 186 3800
#> 3 Adelie Torgersen 40.3 18 195 3250
#> 4 Adelie Torgersen NA NA NA NA
#> 5 Adelie Torgersen 36.7 19.3 193 3450
#> 6 Adelie Torgersen 39.3 20.6 190 3650
#> # ℹ 3 more variables: sex <fct>, year <int>, bill_size_mm2 <dbl>
Let’s select all columns that contain measurements in mm.
penguins %>%
select(ends_with("mm"))
#> # A tibble: 344 × 3
#> bill_length_mm bill_depth_mm flipper_length_mm
#> <dbl> <dbl> <int>
#> 1 39.1 18.7 181
#> 2 39.5 17.4 186
#> 3 40.3 18 195
#> 4 NA NA NA
#> 5 36.7 19.3 193
#> 6 39.3 20.6 190
#> 7 38.9 17.8 181
#> 8 39.2 19.6 195
#> 9 34.1 18.1 193
#> 10 42 20.2 190
#> # ℹ 334 more rows
Let’s select all columns that contain measurements in mm.
penguins %>%
select(contains("mm"))
#> # A tibble: 344 × 3
#> bill_length_mm bill_depth_mm flipper_length_mm
#> <dbl> <dbl> <int>
#> 1 39.1 18.7 181
#> 2 39.5 17.4 186
#> 3 40.3 18 195
#> 4 NA NA NA
#> 5 36.7 19.3 193
#> 6 39.3 20.6 190
#> 7 38.9 17.8 181
#> 8 39.2 19.6 195
#> 9 34.1 18.1 193
#> 10 42 20.2 190
#> # ℹ 334 more rows
Let’s find the median body mass for each species (using
mutate()).
penguins %>%
remove_missing() %>%
group_by(species) %>%
mutate(body_mass_median = median(body_mass_g))
#> # A tibble: 333 × 9
#> # Groups: species [3]
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750
#> 2 Adelie Torgersen 39.5 17.4 186 3800
#> 3 Adelie Torgersen 40.3 18 195 3250
#> 4 Adelie Torgersen 36.7 19.3 193 3450
#> 5 Adelie Torgersen 39.3 20.6 190 3650
#> 6 Adelie Torgersen 38.9 17.8 181 3625
#> 7 Adelie Torgersen 39.2 19.6 195 4675
#> 8 Adelie Torgersen 41.1 17.6 182 3200
#> 9 Adelie Torgersen 38.6 21.2 191 3800
#> 10 Adelie Torgersen 34.6 21.1 198 4400
#> # ℹ 323 more rows
#> # ℹ 3 more variables: sex <fct>, year <int>, body_mass_median <dbl>
Let’s find the median body mass for each species (using
summarize()).
penguins %>%
remove_missing() %>%
group_by(species) %>%
summarize(body_mass_median = median(body_mass_g))
#> # A tibble: 3 × 2
#> species body_mass_median
#> <fct> <dbl>
#> 1 Adelie 3700
#> 2 Chinstrap 3700
#> 3 Gentoo 5050
Let’s find the median of everything! This time also grouping by year
penguins %>%
remove_missing() %>%
group_by(species, year) %>%
summarize(across(where(is.numeric), median))
#> # A tibble: 9 × 6
#> # Groups: species [3]
#> species year bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 Adelie 2007 39 18.6 186 3675
#> 2 Adelie 2008 38.6 18.3 190 3700
#> 3 Adelie 2009 38.7 18.0 191 3600
#> 4 Chinstrap 2007 48.8 18.2 194. 3700
#> 5 Chinstrap 2008 49.2 18.5 198. 3750
#> 6 Chinstrap 2009 50.0 18.6 198 3675
#> 7 Gentoo 2007 46.7 14.6 215 5050
#> 8 Gentoo 2008 46.4 15 219 5000
#> 9 Gentoo 2009 48.8 15.2 218 5200
Let’s create a new column that classifies bill size into two categories – big or small.
threshold <- 800 ### first define a threshold to distinguish big from small
penguins %>%
mutate(bill_size_mm2 = bill_depth_mm * bill_length_mm,
bill_size_binary = ifelse(bill_size_mm2 > threshold, "big", "small")) %>%
select(bill_size_binary, bill_size_mm2, everything()) %>%
head()
#> # A tibble: 6 × 10
#> bill_size_binary bill_size_mm2 species island bill_length_mm bill_depth_mm
#> <chr> <dbl> <fct> <fct> <dbl> <dbl>
#> 1 small 731. Adelie Torgersen 39.1 18.7
#> 2 small 687. Adelie Torgersen 39.5 17.4
#> 3 small 725. Adelie Torgersen 40.3 18
#> 4 <NA> NA Adelie Torgersen NA NA
#> 5 small 708. Adelie Torgersen 36.7 19.3
#> 6 big 810. Adelie Torgersen 39.3 20.6
#> # ℹ 4 more variables: flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
#> # year <int>
The penguins data has three factor variables:
penguins %>%
dplyr::select(where(is.factor)) %>%
glimpse()
#> Rows: 344
#> Columns: 3
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
#> $ sex <fct> male, female, female, NA, female, male, female, male, NA, NA, …
# Count penguins for each species / island
penguins %>%
count(species, island, .drop = FALSE)
#> # A tibble: 9 × 3
#> species island n
#> <fct> <fct> <int>
#> 1 Adelie Biscoe 44
#> 2 Adelie Dream 56
#> 3 Adelie Torgersen 52
#> 4 Chinstrap Biscoe 0
#> 5 Chinstrap Dream 68
#> 6 Chinstrap Torgersen 0
#> 7 Gentoo Biscoe 124
#> 8 Gentoo Dream 0
#> 9 Gentoo Torgersen 0
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(alpha = 0.8) +
scale_fill_manual(values = c("darkorange","purple","cyan4"),
guide = FALSE) +
theme_minimal() +
facet_wrap(~species, ncol = 1) +
coord_flip()
# Count penguins for each species / sex
penguins %>%
count(species, sex, .drop = FALSE)
#> # A tibble: 8 × 3
#> species sex n
#> <fct> <fct> <int>
#> 1 Adelie female 73
#> 2 Adelie male 73
#> 3 Adelie <NA> 6
#> 4 Chinstrap female 34
#> 5 Chinstrap male 34
#> 6 Gentoo female 58
#> 7 Gentoo male 61
#> 8 Gentoo <NA> 5
ggplot(penguins, aes(x = sex, fill = species)) +
geom_bar(alpha = 0.8) +
scale_fill_manual(values = c("darkorange","purple","cyan4"),
guide = FALSE) +
theme_minimal() +
facet_wrap(~species, ncol = 1) +
coord_flip()
# Penguins are fun to summarize!
penguins %>%
count(species)
#> # A tibble: 3 × 2
#> species n
#> <fct> <int>
#> 1 Adelie 152
#> 2 Chinstrap 68
#> 3 Gentoo 124
penguins %>%
group_by(species) %>%
summarize(across(where(is.numeric), mean, na.rm = TRUE))
#> # A tibble: 3 × 6
#> species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Adelie 38.8 18.3 190. 3701. 2008.
#> 2 Chinstrap 48.8 18.4 196. 3733. 2008.
#> 3 Gentoo 47.5 15.0 217. 5076. 2008.
penguins %>%
dplyr::select(body_mass_g, ends_with("_mm")) %>%
glimpse()
#> Rows: 344
#> Columns: 4
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
# Scatterplot example 1: penguin flipper length versus body mass
ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species,
shape = species),
size = 2) +
scale_color_manual(values = c("darkorange","darkorchid","cyan4"))
# Scatterplot example 2: penguin bill length versus bill depth
ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(aes(color = species,
shape = species),
size = 2) +
scale_color_manual(values = c("darkorange","darkorchid","cyan4"))
You can add color and/or shape aesthetics in ggplot2 to
layer in factor levels like we did above. With three factor variables to
work with, you can add another factor layer with facets, like the plot
below.
ggplot(penguins, aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point(aes(color = sex)) +
scale_color_manual(values = c("darkorange","cyan4"),
na.translate = FALSE) +
facet_wrap(~species)
The culmen is the upper ridge of a bird’s bill. In the simplified
penguins data, culmen length and depth are renamed as
variables bill_length_mm and bill_depth_mm to
be more intuitive.
For this penguin data, the culmen (bill) length and depth are measured as shown below (thanks Kristen Gorman for clarifying!):
# Jitter plot example: bill length by species
ggplot(data = penguins, aes(x = species, y = bill_length_mm)) +
geom_jitter(aes(color = species),
width = 0.1,
alpha = 0.7,
show.legend = FALSE) +
scale_color_manual(values = c("darkorange","darkorchid","cyan4"))
# Histogram example: flipper length by species
ggplot(data = penguins, aes(x = flipper_length_mm)) +
geom_histogram(aes(fill = species), alpha = 0.5, position = "identity") +
scale_fill_manual(values = c("darkorange","darkorchid","cyan4"))
Data originally published in:
Individual datasets:
Individual data can be accessed directly via the Environmental Data Initiative:
Palmer Station Antarctica LTER and K. Gorman, 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. https://doi.org/10.6073/pasta/98b16d7d563f265cb52372c8ca99e60f (Accessed 2020-06-08).
Palmer Station Antarctica LTER and K. Gorman, 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Gentoo penguin (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. https://doi.org/10.6073/pasta/7fca67fb28d56ee2ffa3d9370ebda689 (Accessed 2020-06-08).
Palmer Station Antarctica LTER and K. Gorman, 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Chinstrap penguin (Pygoscelis antarcticus) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 6. Environmental Data Initiative. https://doi.org/10.6073/pasta/c14dfcfada8ea13a17536e73eb6fbe9e (Accessed 2020-06-08).